Pendahuluan Data Mining

https://taudata.blogspot.com/p/applied-data-mining-adm.html

Supervised Learning - Classification 02

https://taudata.blogspot.com/2022/04/slcm-03.html

(C) Taufik Sutanto

Notes and Disclaimer¶

  • This notebook is part of the free (open knowledge) eLearning course at: https://tau-data.id
  • Some images are taken from several resources, we respect those images ownerships and put a reference/citation from where it is originated. Nevertheless, sometimes we are having trouble to find the origin of the image(s). If you are the owner of the image and would like the image taken-out (or want the citation to be revised) from this open knowledge course resources please contact us here with the details: https://tau-data.id/contact/
  • Unless stated otherwise, in general tau-data permit its resources to be copied and-or modified for non-commercial purposes. With condition proper acknowledgement/citation is given.

Outline:¶

  • Review Materi Sebelumnya
  • Support Vector Machines
  • Neural Network
  • Ensemble Models
  • Imbalance Data Problem
In [1]:
# Importing Modules untuk Notebook ini
import warnings; warnings.simplefilter('ignore')
import numpy as np, matplotlib.pyplot as plt, pandas as pd, seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn import svm, preprocessing
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_blobs, make_moons, make_circles, make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn import model_selection
from collections import Counter

sns.set(style="ticks", color_codes=True)

"Done"
Out[1]:
'Done'

k-Nearest Neighbour¶

Logistic Regression¶

Teori Decision Tree : Information theory¶

Naive Bayes Classifier¶

  • P(x) konstan, sehingga bisa diabaikan.
  • Asumsi terkuatnya adalah independensi antar variabel prediktor (sehingga dikatakan "Naive")
  • Klasifikasi dilakukan dengan menghitung probabilitas untuk setiap kategori ketika diberikan data x = (x1,x2,...,xm)
  • Untuk data yang besar bisa menggunakan out-of-core approach (partial fit):
    http://scikit-learn.org/stable/modules/scaling_strategies.html#scaling-strategies
  • Variasi NBC adalah bagaimana P(c|x) dihitung, misal dengan distribusi Gaussian (Normal) - sering disebut sebagai Gaussian Naive Bayes (GNB):

Support Vector Machine (SVM)¶

Misal data dinyatakan sebagai berikut: $\{(\bar{x}_1,y_1),...,(\bar{x}_n,y_n)\}$, dimana $\bar{x}_i$ adalah input pattern untuk data ke $i^{th}$ dan $y_i$ adalah nilai target yang diinginkan. Kategori (class) direpresentasikan dengan $y_i=\{-1,1\}$. Sebuah bidang datar (hyperplane) yang memisahkan kedua kelas ini ("linearly separable") adalah: $$ \bar{w}'\bar{x}+b=0 $$ dimana $\bar{x}$ adalah input vector (prediktor), $\bar{w}$ weight, dan $b$ disebut sebagai bias.

Pemodelan SVM (Hard Margin):¶

* Misal **Xo** adalah sebuah vector di bidang (plane/garis) _wX + b = -1_ * Misal **r** adalah jarak antar SV-nya. * karena **X** berada di bidang _wX+b=1_ maka _X=Xo+rw/||w||_ * (lihat gambar *w* tegak lurus *X* (karena _wX+b=0_) dan _w/||w||_ adalah unit vektornya) * Sehingga _wX+b=1_ dapat dituliskan sebagai _w(Xo+r w/||w||)-b = 1_ * atau _wXo+r||w||²/||w||-b=1_ ==> _wXo-b=1-r||w||_ ==> _-1=1-r||w||_ * sehingga di dapat $r = \frac{2}{||w||}$ * Kesimpulannya optimal hyperplane bisa didapatkan dengan memaksimumkan $\frac{2}{||w||}$ atau setara dengan $\min \frac{||w||}{2}$ * More details here: https://nlp.stanford.edu/IR-book/html/htmledition/support-vector-machines-the-linearly-separable-case-1.html
* Efek outlier pada pemodelan ini?

Support Vector Machine: Soft Margin¶

* Apakah efek outlier masih sama pada pemodelan ini? Kaitannya dengan nilai C?

C >>> ==> toleransi terhadap outlier <<<< dan sebaliknya¶

Dual dan Quadratic solver¶

  • optimasi di atas biasanya diselesaikan dengan mencari bentuk Dual-nya.
  • Solusi untuk parameter optimalnya kemudian ditemukan dengan mencari pendekatan nilai optimalnya lewat Quadratic Programming solver.
  • Perhatikan bahwa bentuk fungsi optimasinya konvex ==> memiliki minimum global.
  • Nilai optimal dari pemodelan di atas hanya bergantung pada data-data di margin (support vector) sehingga bisa lebih efisien (jika SV telah diketahui).
  • SV juga dapat digunakan untuk menganalisa "Error Bound" : http://www.svms.org/vc-dimension/

Interpretation¶

  • Recursive Feature Elimination (RFE) method : https://link.springer.com/content/pdf/10.1023/A:1012487302797.pdf
  • melihat bentuk kuadrat dari setiap komponen w (higher better).
  • hati-hati beberapa diskusi di internet menyatakan bahwa sign (+/-) menyatakan tingkat kepentingan terhadap setiap variabel, namun hal ini tidak selalu benar dan bisa dibuktikan cukup dengan counter example.

Bagaimana dengan data kategorik?¶

  • Sama dengan regresi (logistik) ==> Dummy (indicator variable) variable.
  • Misal X1 = {a,b,c} ==> X1_a = [1,0,0], X1_b = [0,1,0], X1_c = [0,0,1]
  • https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
In [2]:
# Contoh
df = pd.DataFrame({'X1': ['a', 'b', 'a','c','a'],'X2': [1, 2, 3, 2, 1]})
df.head()
Out[2]:
X1 X2
0 a 1
1 b 2
2 a 3
3 c 2
4 a 1
In [3]:
df = pd.get_dummies(df)
df.head()
Out[3]:
X2 X1_a X1_b X1_c
0 1 True False False
1 2 False True False
2 3 True False False
3 2 False False True
4 1 True False False

Normalisasi/Standarisasi Data¶

  • Sama seperti Regresi (logistik) prediktor/features di model SVM perlu untuk di standarisasi/normalisasi.
  • http://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range
  • Hati-hati standarisasi data dilakukan setelah outlier ditangani dengan baik.
In [4]:
scaler = preprocessing.StandardScaler(with_mean=True, with_std=True)
df['X2'] = scaler.fit_transform(df[['X2']])
df
Out[4]:
X2 X1_a X1_b X1_c
0 -1.069045 True False False
1 0.267261 False True False
2 1.603567 True False False
3 0.267261 False False True
4 -1.069045 True False False
In [5]:
# Contoh plotting Optimal Hyperplane
# http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html#example-svm-plot-separating-hyperplane-py

X, y = make_blobs(n_samples=20, centers=2, random_state=6) # we create 20 separable points
clf = svm.SVC(kernel='linear', C=1000) # fit the model, don't regularize for illustration purposes
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)
ax = plt.gca();xlim = ax.get_xlim(); ylim = ax.get_ylim()

# create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30);yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,linestyles=['--', '-', '--'])# plot decision boundary and margins
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,linewidth=1, facecolors='none', edgecolors='k')# plot support vectors
plt.show()
No description has been provided for this image

SVM Kernel (trick)

Definisi Fungsi Kernel¶

  • Jika untuk semua $\bar{x},\bar{z} \in X$, memenuhi
$$\kappa (\bar{x},\bar{z})=<\phi (\bar{x}),\phi (\bar{z})>$$

maka $\kappa$ disebut fungsi Kernel (fungsi $\phi$ disebut feature map).

  • Perhatikan hasil pemetaan fungsi kernelnya adalah scalar (inner product).
  • Fungsi ini digunakan di SVM (dan model DM/ML lain yang bisa dinyatakan dalam inner product).
  • Perhatikan pemodelan SVM; kebanyakan dinyatakan dalam inner product (i.e. w.x).
  • See here for more details: https://nlp.stanford.edu/IR-book/html/htmledition/nonlinear-svms-1.html

Contoh : Lagrangian Wolfe Dual dari Optimasi diatas

Contoh 1¶

  • Misal $X\subseteq \Re^2$ dan $\phi : \bar{x}=(x_1,x_2)\rightarrow \phi (\bar{x})=(x_1^2,

x_2^2,\sqrt{2}x_1x_2)\in F=\Re^3$.

  • maka

$<\phi(\bar{x}),\phi(\bar{z})>$
$=<(x_1^2,x_2^2,\sqrt{2}x_1x_2),(z_1^2,z_2^2,\sqrt{2}z_1z_2)>$
$=x_1^2z_1^2+x_2^2z_2^2+2x_1x_2z_1z_2$
$=(x_1z_1+x_2z_2)^2=<\bar{x},\bar{z}>^2$

  • Sehingga $\kappa(\bar{x},\bar{z})=<\bar{x},\bar{z}>^2$ adalah sebuah fungsi kernel dan $F$ adalah ruang feature-nya (feature space).

Contoh 2¶

  • Misal x = (x1, x2, x3); y = (y1, y2, y3).
  • dan fungsi pemetaan variabelnya f(x) = (x1², x1x2, x1x3, x2x1, x2², x2x3, x3x1, x3x2, x3²),
  • maka kernelnya adalah K(x, y ) = <f(x), f(y)> = <x, y>².
  • Contoh numerik misal x = (1, 2, 3) dan y = (4, 5, 6). maka:
  • f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
    f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)
  • <f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024
  • complicated!... Menggunakan fungsi kernel perhitungannya bisa disederhanakan:
  • K(x, y) = (4 + 10 + 18)² = 32² = 1024

Well-Known Kernel Functions

SVM Binary to MultiClass

Pros

  • Akurasinya Baik
  • Bekerja dengan baik untuk sampel data yang relatif kecil
  • Hanya bergantung pada SV ==> meningkatkan efisiensi
  • Convex ==> Minimum Global ==> Pasti Konvergen

Cons

  • Tidak efisien untuk data yang besar
  • Akurasi terkadang rendah untuk multiklasifikasi (sulit mendapatkan hubungan antar kategori di modelnya)
  • Tidak robust terhadap noise
In [6]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, seaborn as sns

# load the iris data
df = sns.load_dataset("iris")
g = sns.pairplot(df, hue="species")
No description has been provided for this image
In [7]:
df.sample(7)
Out[7]:
sepal_length sepal_width petal_length petal_width species
32 5.2 4.1 1.5 0.1 setosa
66 5.6 3.0 4.5 1.5 versicolor
18 5.7 3.8 1.7 0.3 setosa
118 7.7 2.6 6.9 2.3 virginica
83 6.0 2.7 5.1 1.6 versicolor
135 7.7 3.0 6.1 2.3 virginica
55 5.7 2.8 4.5 1.3 versicolor
In [8]:
df.describe(include= 'all')
Out[8]:
sepal_length sepal_width petal_length petal_width species
count 150.000000 150.000000 150.000000 150.000000 150
unique NaN NaN NaN NaN 3
top NaN NaN NaN NaN setosa
freq NaN NaN NaN NaN 50
mean 5.843333 3.057333 3.758000 1.199333 NaN
std 0.828066 0.435866 1.765298 0.762238 NaN
min 4.300000 2.000000 1.000000 0.100000 NaN
25% 5.100000 2.800000 1.600000 0.300000 NaN
50% 5.800000 3.000000 4.350000 1.300000 NaN
75% 6.400000 3.300000 5.100000 1.800000 NaN
max 7.900000 4.400000 6.900000 2.500000 NaN
In [9]:
# Separate the data
df2 = df[df['species'].isin(['versicolor', 'setosa'])] # Ambil binary contoh 2 kelas saja

X = df2[['sepal_length','sepal_width','petal_length','petal_width']]
Y = df2['species']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3)
print(X_train.shape, X_test.shape)
(70, 4) (30, 4)
In [10]:
Y
Out[10]:
0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
95    versicolor
96    versicolor
97    versicolor
98    versicolor
99    versicolor
Name: species, Length: 100, dtype: object
In [11]:
# Fitting and evaluate the model
dSVM = svm.SVC(C = 10**5, kernel = 'linear')
dSVM.fit(X_train, Y_train)

y_SVM = dSVM.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_SVM))
print(confusion_matrix(Y_test, y_SVM))
print(classification_report(Y_test, y_SVM))
Akurasi =  1.0
[[17  0]
 [ 0 13]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        17
  versicolor       1.00      1.00      1.00        13

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

In [12]:
# The Support Vectors
print('index dr SV-nya: ', dSVM.support_)
print('Vector Datanya: \n', dSVM.support_vectors_)
index dr SV-nya:  [ 6 41 39]
Vector Datanya: 
 [[4.5 2.3 1.3 0.3]
 [5.1 3.3 1.7 0.5]
 [5.1 2.5 3.  1.1]]
In [13]:
# Model Weights for interpretations
print('w = ',dSVM.coef_)
print('b = ',dSVM.intercept_)
w =  [[ 0.04621298 -0.52129234  1.00306886  0.46413981]]
b =  [-1.45238036]
In [14]:
# Menggunakan Kernel: http://scikit-learn.org/stable/modules/svm.html#svm-kernels
for kernel in ('sigmoid', 'poly', 'rbf'):
    dSVM = svm.SVC(kernel=kernel)
    dSVM.fit(X_train, Y_train)
    y_SVM = dSVM.predict(X_test)
    print(accuracy_score(Y_test, y_SVM))
0.43333333333333335
1.0
1.0
In [15]:
# Contoh Multiklasifikasi SVM (dengan dan tanpa kernel)
X = df[['sepal_length','sepal_width','petal_length','petal_width']]
Y = df['species'] # Menggunakan seluruh spesies (3 kategori)

X = preprocessing.StandardScaler().fit_transform(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3)
print(X_train.shape, X_test.shape)
df.describe(include='all')
(105, 4) (45, 4)
Out[15]:
sepal_length sepal_width petal_length petal_width species
count 150.000000 150.000000 150.000000 150.000000 150
unique NaN NaN NaN NaN 3
top NaN NaN NaN NaN setosa
freq NaN NaN NaN NaN 50
mean 5.843333 3.057333 3.758000 1.199333 NaN
std 0.828066 0.435866 1.765298 0.762238 NaN
min 4.300000 2.000000 1.000000 0.100000 NaN
25% 5.100000 2.800000 1.600000 0.300000 NaN
50% 5.800000 3.000000 4.350000 1.300000 NaN
75% 6.400000 3.300000 5.100000 1.800000 NaN
max 7.900000 4.400000 6.900000 2.500000 NaN
In [16]:
set(Y_train)
Out[16]:
{'setosa', 'versicolor', 'virginica'}
In [17]:
# One Versus All: http://www.jmlr.org/papers/volume5/rifkin04a/rifkin04a.pdf
dSVM = svm.LinearSVC()
dSVM.fit(X_train, Y_train)
y_SVM = dSVM.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_SVM))
y_SVM
Akurasi =  0.9555555555555556
Out[17]:
array(['versicolor', 'versicolor', 'setosa', 'virginica', 'setosa',
       'versicolor', 'setosa', 'setosa', 'virginica', 'versicolor',
       'setosa', 'virginica', 'versicolor', 'versicolor', 'setosa',
       'virginica', 'versicolor', 'virginica', 'setosa', 'virginica',
       'versicolor', 'virginica', 'setosa', 'setosa', 'virginica',
       'virginica', 'versicolor', 'versicolor', 'setosa', 'virginica',
       'setosa', 'setosa', 'versicolor', 'versicolor', 'setosa', 'setosa',
       'versicolor', 'setosa', 'versicolor', 'setosa', 'setosa',
       'versicolor', 'setosa', 'virginica', 'versicolor'], dtype=object)
In [18]:
# Ada 3 classifier (as expected)
dSVM.coef_
Out[18]:
array([[-0.22672574,  0.41106345, -0.6761091 , -0.63330724],
       [ 0.01212648, -0.43968806,  0.74023292, -0.75194366],
       [-0.09711928, -0.43961137,  1.61658978,  1.39170709]])
In [19]:
# All At Once Method http://www.jmlr.org/papers/volume2/crammer01a/crammer01a.pdf
dSVM = svm.SVC(decision_function_shape='ovo')
dSVM.fit(X_train, Y_train)
y_SVM = dSVM.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_SVM))
y_SVM
Akurasi =  0.9333333333333333
Out[19]:
array(['versicolor', 'versicolor', 'setosa', 'virginica', 'setosa',
       'versicolor', 'setosa', 'setosa', 'versicolor', 'versicolor',
       'setosa', 'virginica', 'versicolor', 'virginica', 'setosa',
       'virginica', 'versicolor', 'virginica', 'setosa', 'virginica',
       'versicolor', 'virginica', 'setosa', 'setosa', 'virginica',
       'virginica', 'versicolor', 'versicolor', 'setosa', 'virginica',
       'setosa', 'setosa', 'versicolor', 'virginica', 'setosa', 'setosa',
       'versicolor', 'setosa', 'versicolor', 'setosa', 'setosa',
       'versicolor', 'setosa', 'virginica', 'versicolor'], dtype=object)

Artificial Neural Network - Jaringan Syaraf Tiruan

Toy Data Example Neural Network (Back Propagation)

Multiclass ANN¶

Melihat pemodelan Matematis dan cara kerja Neural Network, apakah kita perlu melakukan standarisasi data juga seperti SVM dan Regresi Logistic?¶

Neural Network - Empirical Analysis Parameter di ANN

https://goo.gl/3rcnc9

Mengapa dengan fungsi linear bisa membentuk "boundary" yang melengkung (kurva)?

http://s.id/j6i

Neural Network VS Deep Learning¶

In [20]:
# Neural Network: http://scikit-learn.org/stable/modules/neural_networks_supervised.html
NN = MLPClassifier(hidden_layer_sizes=(100,))# 2 layers 30 neurons and 20 neurons
NN.fit(X_train, Y_train)
y_NN = NN.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_NN))
Akurasi =  0.9333333333333333

Induktif bias :¶

  • Bias penaksiran parameter (statistik)
  • Induktif Bias Sample (Machine Learning - Tom Mitchel)
  • Induktif Bias Pemilihan Classifier (Statistical Learning Theory - Vapnik)
In [21]:
h, i = .02, 1  # step size in the mesh , iterate over datasets
names = ["Nearest Neighbors", "Logistic Regression", "Naive Bayes", "Linear SVM", "RBF SVM",
         "Decision Tree", "Random Forest", "Neural Net"]

classifiers = [KNeighborsClassifier(3),
    LogisticRegression(solver='lbfgs',multi_class='multinomial'),
    GaussianNB(), SVC(kernel="linear", C=0.025), SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1)]

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),make_circles(noise=0.2, factor=0.5, random_state=1),linearly_separable]
figure = plt.figure(figsize=(27, 9))

for ds_cnt, ds in enumerate(datasets):
    # preprocess dataset, split into training and test part
    X, y = ds
    X = preprocessing.StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=42)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    if ds_cnt == 0:
        ax.set_title("Input data")
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
               edgecolors='k')
    # Plot the testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
               edgecolors='k')
    ax.set_xlim(xx.min(), xx.max()); ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(()); ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        # Plot the decision boundary. For that, we will assign a color to each
        # point in the mesh [x_min, x_max]x[y_min, y_max].
        if hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,edgecolors='k')
        # Plot the testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,edgecolors='k', alpha=0.6)

        ax.set_xlim(xx.min(), xx.max());ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(()); ax.set_yticks(())
        if ds_cnt == 0:
            ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                size=15, horizontalalignment='right')
        i += 1

plt.tight_layout();plt.show()
No description has been provided for this image

Ensemble Model

  • What? a learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions.
  • Why? Better prediction, More stable model
  • How? Bagging & Boosting

“meta-algorithms” : Bagging & Boosting¶

  • Ensemble https://www.youtube.com/watch?v=Un9zObFjBH0
  • Bagging https://www.youtube.com/watch?v=2Mg8QD0F1dQ
  • Boosting https://www.youtube.com/watch?v=GM3CDQfQ4sw

AdaBoost

  • https://youtu.be/BoGNyWW9-mE?t=70
In [24]:
# Contoh Voting (Bagging) di Python
# Catatan : Random Forest termasuk Bagging Ensemble (walau modified)

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
file = 'data/diabetes_data.csv'

try:
    # Local jupyter notebook, assuming "file" is in the "data" directory
    data = pd.read_csv(file, names=names).values # Rubah ke numpy array
except:
    # it's a google colab... create folder data and then download the file from github
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudataanalytics/Data-Mining--Penambangan-Data--Ganjil-2024/master/data/diabetes_data.csv
    data = pd.read_csv(file, names=names).values # Rubah ke numpy array

X, Y = data[:,0:8], data[:,8] # Slice
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3)

kNN = KNeighborsClassifier(3)
kNN.fit(X_train, Y_train)
Y_kNN = kNN.score(X_test, Y_test)

DT = DecisionTreeClassifier(random_state=1)
DT.fit(X_train, Y_train)
Y_DT = DT.score(X_test, Y_test)

model = VotingClassifier(estimators=[('k-NN', kNN), ('Decision Tree', DT)], voting='hard')
model.fit(X_train,Y_train)
Y_Vot = model.score(X_test,Y_test)

print('Akurasi k-NN', Y_kNN)
print('Akurasi Decision Tree', Y_DT)
print('Akurasi Votting', Y_Vot)
Akurasi k-NN 0.696969696969697
Akurasi Decision Tree 0.70995670995671
Akurasi Votting 0.7142857142857143
In [25]:
# Averaging juga bisa digunakan di Klasifikasi (ndak hanya Regresi), 
# tapi kita pakai probabilitas dari setiap kategori
T = DecisionTreeClassifier()
K = KNeighborsClassifier()
R= LogisticRegression()

T.fit(X_train,Y_train)
K.fit(X_train,Y_train)
R.fit(X_train,Y_train)

y_T=T.predict_proba(X_test)
y_K=K.predict_proba(X_test)
y_R=R.predict_proba(X_test)

Ave = (y_T+y_K+y_R)/3
print(Ave[:5]) # Print just first 5
prediction = [v.index(max(v)) for v in Ave.tolist()]
print(prediction[:5]) # Print just first 5
print('Akurasi Averaging', accuracy_score(Y_test, prediction))
[[0.27676107 0.72323893]
 [0.06742822 0.93257178]
 [0.29979175 0.70020825]
 [0.65073792 0.34926208]
 [0.98164276 0.01835724]]
[1, 1, 1, 0, 0]
Akurasi Averaging 0.7402597402597403
In [30]:
# AdaBoost
num_trees = 100
kfold = model_selection.KFold(n_splits=10, random_state=9, shuffle=True)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=1)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
0.746120984278879

Imbalance Data

* Metric Trap * Akurasi kategori tertentu lebih penting * Contoh kasus
  • Undersampling
  • Oversampling
  • Model Based (weight adjustment)
* https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets * Plot perbandingan: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/combine/plot_comparison_combine.html#sphx-glr-auto-examples-combine-plot-comparison-combine-py
In [31]:
data = pd.read_csv(file, names=names)
data.head()
Out[31]:
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
In [32]:
plot = data["class"].value_counts().plot(kind='pie')
No description has been provided for this image
In [33]:
# Example of model-based imbalance treatment - SVM
n_samples_1, n_samples_2 = 1000, 100
centers = [[0.0, 0.0], [2.0, 2.0]]
clusters_std = [1.5, 0.5]
X, y = make_blobs(n_samples=[n_samples_1, n_samples_2],centers=centers,cluster_std=clusters_std,random_state=0, shuffle=False)

# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)

# fit the model and get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10}) #WEIGHTED SVM
wclf.fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors='k')# plot the samples
ax = plt.gca()# plot the decision functions for both classifiers
xlim = ax.get_xlim(); ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 30)# create grid to evaluate model
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane
a = ax.contour(XX, YY, Z, colors='k', levels=[0], alpha=0.5, linestyles=['-']) # plot decision boundary and margins
Z = wclf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane for weighted classes
b = ax.contour(XX, YY, Z, colors='r', levels=[0], alpha=0.5, linestyles=['-'])# plot decision boundary and margins for weighted classes
plt.legend([a.collections[0], b.collections[0]], ["non weighted", "weighted"], loc="upper right")
plt.show()
No description has been provided for this image

Weighted Decision Tree¶

In [34]:
data = pd.read_csv(file, names=names).values # Rubah ke numpy array
X, Y = data[:,0:8], data[:,8] # Slice
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3)

del T
T = DecisionTreeClassifier(random_state = 0)
T.fit(X_train,Y_train)
y_DT = T.predict(X_test)
print('Akurasi  (Decision tree Biasa) = ', accuracy_score(Y_test, y_DT))
print(classification_report(Y_test, y_DT))

del T
T = DecisionTreeClassifier(class_weight = 'balanced', random_state = 0)
T.fit(X_train,Y_train)
y_DT = T.predict(X_test)
print('Akurasi  (Weighted Decision tree) = ', accuracy_score(Y_test, y_DT))
print(classification_report(Y_test, y_DT))
Akurasi  (Decision tree Biasa) =  0.6883116883116883
              precision    recall  f1-score   support

         0.0       0.75      0.75      0.75       146
         1.0       0.58      0.58      0.58        85

    accuracy                           0.69       231
   macro avg       0.66      0.66      0.66       231
weighted avg       0.69      0.69      0.69       231

Akurasi  (Weighted Decision tree) =  0.7142857142857143
              precision    recall  f1-score   support

         0.0       0.77      0.77      0.77       146
         1.0       0.61      0.61      0.61        85

    accuracy                           0.71       231
   macro avg       0.69      0.69      0.69       231
weighted avg       0.71      0.71      0.71       231

End of Module